Improving Memory Hierarchy Performance through Combined Loop Interchange and Multi-Level Fusion
نویسندگان
چکیده
Because of the increasing gap between the speeds of processors and main memories, compilers must enhance the locality of applications to achieve high performance. Loop fusion enhances locality by fusing loops that access similar sets of data. Typically, it is applied to loops at the same level after loop interchange, which first attains the best nesting order for each local loop nest. However, since loop interchange cannot foresee the overall optimization effect, it often selects the wrong loops to be placed outermost for fusion, achieving suboptimal performance globally. Building on traditional unimodular transformations on perfectly nested loops, we present a novel transformation, dependence hoisting, that effectively combines interchange and fusion for arbitrarily nested loops. We present techniques to simultaneously interchange and fuse loops at multiple levels. By evaluating the compound optimization effect beforehand, we have achieved better performance than that was possible by previous techniques, which apply interchange and fusion separately.
منابع مشابه
Set Associative Cache Behavior Optimization
One of the most important issues related to program performance is the memory hierarchy behavior. Programmers try nowadays to optimize this behavior intuitively or using costly techniques such as trace-driven simulations through a trial and error process. A systematic modeling strategy that allows an automated analysis of the memory hierarchy performance is developed in this work. This approach...
متن کاملMultilevel Blocking in Complex Iteration Spaces
This paper presents a new unified method for simultaneously tiling the register and cache levels of the memory hierarchy. We will focus on the code transformation phase of tiling. Our algorithm uses strip-mining and loop interchange on all memory hierarchy levels to determine the tiles as usual, and, afterwards, and due to the special characteristics of the register level, we apply index set sp...
متن کاملA Cache-Conscious Profitability Model for Empirical Tuning of Loop Fusion
Loop fusion is recognized as an effective program transformation for improving memory hierarchy performance. However, unconstrained loop fusion can lead to poor performance because of increased register pressure and cache conflict misses. The complex interaction between different levels of the memory hierarchy with the input program makes it very difficult to always make the right choice in fus...
متن کاملAutomatic selection of high-order transformations in the IBM XL FORTRAN compilers
The IBM ASTl optimizer provides the foundation for high-order transformations and automatic shared-memory parallelization in the latest IBM XL FORTRAN (XLF) compilers for RS/6000'" and PowerPC@ uniprocessors and symmetric multiprocessors (SMPs), and for automatic distributed-memory parallelization in the IBM XL High-Performance FORTRAN (XLHPF) compiler for the SP' " distributed-memory multiproc...
متن کاملWave Equation Based Stencil Optimizations on Multi-core CPU
As the engine for seismic imaging algorithms, stencil kernels modeling wave propagation are both computeand memoryintensive. This work targets improving the performance of wave equation based stencil code parallelized by OpenMP on a multi-core CPU. To achieve this goal, we explored two techniques: improving vectorization by using hardware SIMD technology, and reducing memory traffic to mitigate...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- IJHPCA
دوره 18 شماره
صفحات -
تاریخ انتشار 2004